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Abstract 


An important part of the legacy of Evarist Gine is his fundamental contributions 
to our understanding of [/-statistics and [/-processes. In this paper we discuss the 
estimation of the mean of multivariate functions in case of possibly heavy-tailed distri¬ 
butions. In such situations, reliable estimates of the mean cannot be obtained by usual 
[/-statistics. We introduce a new estimator, based on the so-called median-of-means 
technique. We develop performance b ounds for this new estimator that generalizes an 
estimate of Arcones and Gin^ 1 19931) . showing that the new estimator performs, un¬ 
der minimal moment conditions, as well as classical [/-statistics for bounded random 
variables. We discuss an application of this estimator to clustering. 


1 Introduction 


Motivated by numerous applications, the theory of [/-statistics and [/-processes has re- 

ceived considerable atte ntion in the past decades, [/-s tatistics appear naturall y in ranking _ 

( Clemencon et ah . 2008 1. clustering ( Clemencon . 2014l l and learning on graphs ( Biau and Bleaklev . 


2006l l or as components of h igher-order terms in expansions of smooth statistics, see, for 


example, Robins et al. ( 2009l l. The general setting may be described as follows. Let X be 
a random variable taking values in some measurable space A and let h : —>• M be a 

measurable function of m > 2 variables. Let P be the probability measure of X. Suppose 
we have access to n > m independent random variables Xi,... ,Xn, all distributed as X. 
We define the [/-statistics of order m and kernel h based on the sequence {Xi} as 


[/„(/,) = X h{x,,,...,x,j , (1) 

n! , 
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where 


In — {(*1) • • •) *m) • ^ — ij — '^1 7^ '^k if j 7^ k} 

is the set of all m-tuples of different integers between 1 and n. [/-statistics are unbiased es¬ 
timators of the mean nih = '& h(X ^,..., Xm) and have minimal variance among all unbiased 
estimators ( Hoeffding . 1948l b Understanding the conc entration of a [/-stati s tics a round its 
expected value has been subject o f extensive study , de la Pena and Gind (1999) provide 
an excellent summary but see also Gine et al.l (I 2 OO 0 I I for a more recent development. 

By a classical inequality of iHoeffdind ([l963j), for a bounded kernel h, for all (5 > 0, 


P < |[/„(/i) - mh\ > 


log(|) 


y 2[n/mJ J 

and we also have the “Bernstein-type” inequality 


<<5, 


( 2 ) 


/dcT^ log(l) 4 
[/„ h -m, > W—-^V- 


,log(j 


6[n/mJ 


<<5, 


where cr^ = Var {h{Xi,... ,Xm)). 

However, under certain degeneracy assumptions on the kernel, signihcan t ly sh arper 
bounds have been proved. Following the exposition of de la Pena and Gind ( 19991 ). for 
convenience, we restrict out attention to symmetric kernels. A kernel h is symmetric if for 
all xi,..., Xm G P and all permutations s, 

h{xi,.. .,Xm) = h{xsm- ■ ■.XsJ) . 

A symmetric kernel h is said to be P-degenerate of order q — 1, 1 < q < m, if for all 

Xi,.. .,Xq-i G A, 

/l(xi, . . . , Xm)dP ^ iXqi • • • 1 ^m) — h(^Xi, . . . , Xm'jdP (xi, . . . , Xm) 


and 


(xi, . . . ,Xg) 1-^ J /(xi, . . . ,Xm)dH™- ''(Xq+l, . . . ,Xm) 


is not a constant function. In the special case of = 0 and q = m (i.e., when the kernel 
is (m — l)-degenerate, h is said to be P-canonical . P-canonical kernels appea r naturally in 
the Hoeffding decomp o sition of a [/-statistic, see de la Pena and Gind ( 1999l l. 


Arcones and Gina (| 199, 'll ) proved the following important improvement of Hoefhng’s 


inequalities for canonical kernels: If h — mu is a bounded, symmetric P-canonical kernel of 
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m variables, there exist finite positive constants ci and C 2 depending only on m such that 
for all 5 G (0,1), 


\Un{h) - mh\ > Cl 


/log(^uW2- 

\ n 


and also 


\Un{h) - mh\ > 


C2n 


m/2 


V 


V C2 


<5, 


(m+l)/2' 


< (5 . 


In the special case of P-canonical kernels of order m = 2, ([3]) implies that 


\Unih) - nihl < 


Cl 


n 


■log(| 


( 3 ) 


( 4 ) 


( 5 ) 


with probability at least 1 — <5. Note that this rate of convergence is significantly faster 
than the rate implied by (|2|). 

All the results cited above require boundedness of the kernel. If the kernel is unbounded 
but h{Xi, ..., Xm) has sufficiently lig ht (e.g., sub - Gauss ian) tails, then some of these results 


may be extended, see, for example, lOine et al.l (l2nnnl b However, if h{Xi,.. .,Xm) may 


have a heavy-tailed distribution, exponential inequalities do not hold anymore (even in the 
univariate m = 1 case). However, even though [/-statistics may have an erratic behavior in 
the presence of heavy tails, in this paper we show that under minimal moment conditions, 
one may construct estimators of ruh that satisfy exponential inequalities analogous to ([ 2 ]) 
and ([3]). These are the main results of the paper. In particular, in Section [2] we introduce 
a robust estimator of the mean rrih. Theorems [T] and [3] establish exponential inequalities 
for the performance of the new estimator under minimal moment assumptions. More 
precisely, Theorem [1] only requires that h{Xi, ..., Xm) has a finite variance and establishes 
inequalities analogous to ([3]) for P-degenerate kernels. In Theorem [3] we further weaken 
the conditions and only assume that there exists 1 < p <2 such that E|/i|p < oo. 

The next example illustrates why classical P-statistics fail under heavy-tailed distribu¬ 
tions. 

Example. Consider the special case m = 2, EAi = 0 and h{Xi,X2) = XiX2. Note that 

this kernel is P-canonical. We define Yi,... ,Yn as independent copies of Xi. Xn. Byde- 

coupling inequalities for the tail of P-statistics given in Theorem 3.4.1 in de la Pena and Gind 
(H^) (see also Theorem[7]in the Appendix), Un{h) has a similar tail behavior to (4 ^i) 

Thus, Un{h) behaves like a product of two independent empirical mean estimators of the 
same distribution. When the Xi are heavy tailed, the empirical mean is known to be a 
poor estimator of the mean. As an example, assume that X follows an a-stable law S{'y, a) 
for some a G (1, 2) and 7 > 0. Recall that a random variable X has an a-stable law S{'y, a) 
if for all ti G M, 

Eexp(iuX) = exp(— 7 "|rt|") 


1 

1-1 2-jj=l ^3 
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(see Zolotare^ ( 19861 ^. NolaiJ ( 20151 )^. Then it follows from the properties of a-stable 
distributions (summarized in Proposition [U] in the Appendix) that there exists a constant 
c > 0 depending only on a and 7 such that 


^[Unih) > n2/“-2| 


> c 


and therefore there is no hope to reproduce an upper bound like ([5|). Below we show how 
this problem can be dealt with by replacing the [/-statistics by a more robust estimator. 

Our approach is based on robust mean estimators in the univariate setting. Estimation 
of the mean of a possibly heavy-tailed random variable X from i.i.d. sample Ai , ..., Xn 


has recently received increasing attention. Introduced bv iNemirovskv and YudinI (jl983l ). 
the median-of-means estimator takes a confidence level 6 G ( 0 , 1 ) and divides the data into 
V ~ log 5“^ blocks. For each block k = one may compute the empirical mean 

/ifc on the variables in the block. The median /I of the fik is the so-called median-of-means 
estimator. A short analysis of the resulting estimator shows that 


l/U - m-ftl < cy/Var [X] ^ 

V n 


wit h probability at least 1 — ^ for a numerical constant c. For the details of the proof 
Lerasle and Oliveiral (j2nill ). When the variance is infinite but a moment of order 


see 


(l2ni.'il l. 


1 < p < 2 exists, the median-of means estimator is still useful, see iBubeck et al. 

This estimator has recently been st udied in various con t exts. M-estimation based on 


this technique has been developed bv iLerasle and Oliveiral (1201111 an d ge neralizations in a 
multivariate context have been discussed by Hsu and Sabatol ( 201,11 ) and Minsker ( 2015 ). 
A similar idea was used in lAlon et al.l (1200211. An int eresting alternative of the median-of- 
means estimator has been proposed by Catoni ( 20121 ). 

The rest of the paper is organized as follows. In Section [2] we introduce a robust 
estimator of the mean and present performance bounds. In particular. Section 12.11 
deals with the finite variance case. Section 12.21 is dedicated to case when h has a finite p-th 
moment for some 1 < p < 2 for P-degenerate kernels. Finally, in Section [3l we present an 
application to clustering problems. 


2 Robust [/-estimation 

In this section we introduce a “median-of-means”-style estimator of = E/i(Ai,..., Xm)- 
To define the estimator, one divides the data into V blocks. For any m-tuple of different 
blocks, one may compute a (decoupled) [/-statistics. Finally, one computes the median of 
all the obtained values. The rigorous definition is as follows. 

The estimator has a parameter V < n, the number of blocks. A partition B = 
(Pi,..., By) of {1,..., n} is called regular if for all P = 1,..., F, 


\Bk\- 


< 1 
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For any Bi ^, ■ ■ ■, Bi^ in we set 

= {(^l! • • •) km) '■ kj E I 

and 

UBi^,...,Bi^ih) = -——I ^ h{Xki,...,Xk^) . 

For any integer N and any vector (oi,..., qn) € we define the median Med(ai,..., a^) 
as any number b such that 

|{i < : tti < b}\ > — and |{i < iV : Oj > 6}| > — . 

Finally, we define the robust estimator: 

U^ih) =Med{UBi^,...,Bi^ih) : ij e {1,... ,V},1 < h < ... < im < V} . (6) 

Note that, mostly in order to simplify notation, we only take those values of UBi^,...,Bi^ (h) 
into account that correspond to distinct indices ii < ■ ■ ■ < im- Thus, each UBi^,...,Bi^{h) 
is a so-called decoupled [/-statistics (see the Appendix for the definition). One may in¬ 
corporate all m-tuples (not necessarily with distinct indices) in the computation of the 
median. However, this has a minor effect on the performance. Similar bounds may be 
proven though with a more complicated notation. 

A simpler alternative is obtained by taking only “diagonal” blocks into account. More 
precisely, let [/s. {h) be the [/-statistics calculated using the variables in block Bi (as defined 
in ([1])). One may simply calculate the median of the V different [/-statistics UBi{h). This 
version is easy to analyze because |{[ < H : [Z^. (/i) > 6}| is a sum of independent random 
variables. However, this simple version is wasteful in the sense that only a small fraction 
of possible m-tuples are taken into account. 

In the next two sections we analyze the performance of the estimator [/^(/i). 


2.1 Exponential inequalities for P-degenerate kernels with finite vari¬ 
ance. 


Next we present a performance bound of the estimator Ujs{h) in the case when is finite. 
The somewhat more complicated case of infinite second moment is treated in Section 12.21 


Theorem 1. Let Xi ,..., X^ he i.i.d. random variables taking values in X. Let h : /f"* i—)■ M 
he a symmetric kernel that is P-degenerate of order q — 1. Assume Var{h{Xi ,..., Xm)) = 
< oo. Let 5 E (0,^) he such that |'log(l/(5)] < Let B be a regular partition of 

{1,... ,n} with \B\ = 32m |'log(l/(5)]. Then, with probability at least 1 — 26, we have 


\UB{h) - m/j| < Kmcr f ^ 


q/2 


(7) 


where Km 


2lm+i 


m 

m 2 . 
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When q = m, the kernel h — rrih is P-canonical and the rate of convergence is then 
given by (logThus, the new estimator has a performance similar to standard 
t/-statistics as in m and ([3]) but without the boundedness assumption for the kernel. It 
is important to note that a disadvantage of the estimator U^ih) is that it depends on the 
confidence level 6 (through the number of blocks). For different confidence levels, different 
estimators are used. 

Because of its importance in applications, we spell out the special case when m = q = 2. 
In Section [3] we use this result in an example of cluster analysis. 

Corollary 2. Let S G (0,1/2). Let h : i—)• R 6e a P-canonical kernel with = 

Var{h{Xi, X 2 )) and let n > 128(1 + log(l/5)). Then, with probability at least 1 — 26, 


\UB{h) — mh\ < 512(7 


1 + log(l/(l) 


n 


( 8 ) 


In the proof of Theorem [T] we need the notion of Hoeffding decomposition ( Hoeffdin^ . 


Il948li of [/-statistics. For probability measures Pi, 
f h d{Pi, ..., Pm)- For a symmetric kernel h : 
defined, for 0 < fc < m and xi,..., G T, as 


, Pm , define Pi x 


X Pm.h — 




the Hoeffding projections are 


TTkh{xi, ...,Xk):= (4i - P) X • • • X {60,,^ - P) X P™ ^h 


where 5x denotes the Dirac measure at the point x. Observe that vroh = P'^h and for 
k > 0, TTfc/i is a P-canonical kernel, h can be decomposed as 

m 

/l(xi, . . . ,Xm) = ^ ^ 7rfc/l(Xii, . . . ,XiJ . (9) 

fc =0 

If h is assumed to be square-integrable (i.e., P'^h? < 00 ), the terms in ([9|) are orthogonal. 
If h is degenerate of order (? — 1, then for any 1 < k < q — 1, r^kh = t). 

Proof of TheoremUl We begin with a “weak” concentration result on each 

Let Bi ^,..., Bi^ be elements of B. For any B £ B, we have 2 ^ < \B\ < ■^. We denote by 

k = (ki ,..., km) an element of ■ We have, by the above-mentioned orthogonality 
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property, 

Var i Ur b 

^ Y, E [{h{X^^ ,...,XkJ-P^h){h{Xi^,...,XiJ-P'^h)] 


= E 


< 





1 




1 

\B^,y■ 



ke/s, ,...,Ba 
*1 ’ ’ 

1^-^S B' 


Y ^ ^^^E [7rs/i(Xi,... ,Xs)^] (by orthogonality) 

^ .«= rj ' ' 


ke/s.^,...,fl,^ s-q 
. Bim 


ke/s. ,...,B. s=q i=0 
*1 ’ ’ 


2n 


m—t 


E EE(:)':['.'*w.'-^.t]x(i 


The last inequality is obtained by counting, for any fixed k and t, the number of elements 
1 such that |k n 1| = t. Thus, 


Varfc/s. B 


< 


< 


< 


< 


1 


\Bn\.. 

• Bi^ 

1 

\B^A■■ 

■\B,J 

1 

m . 

V ( 

f n \ 

m 2_^\ 

\m) 

s=q 

22m-g+l|^|g ^ 


s=q t=q 
m 


2n 


m—t 


EE(jEb.MA'i,...,x.f]x(i^ 

2n 

VM 

t=q A I I - 

2n'”-« 


s=q A / i- 


s=q ^ '' 


On the other hand, we have, by ([9]) 
Var {h) = E 


2n 


s=q l<ii<...<is<m 


Y Y ^[{^sh{X,^,...,Xjf 

s=q 


YYy[i^sHXu...,Xs)y 

s=q ^ '' 
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Combining the two displayed equations above, 



< 


22m-g+l|^|q ^ 


< 




By Chebyshev’s inequality, for all r E (0,1), 




| 5 | 9/2 


n 


ql2„ll2 


< r . 


( 10 ) 


We set X = 2"^o- Jyl'iya , and 


W = 


{(n,...,z^) E{l,...,yr : 1 <*1 < ••• <*m < 1^1, UB,^,...,B,Jh)-P^h>x] 


The random variable j^wNx is a [/-statistics of order m with the symmetric kernel 

\ m / 

g : {ii,... ,im) '->■ '^{Ub- b- (h)-P^h>x}- Thus, Hoeffding’s inequality for centered U- 
statistics ([2]) gives 


P < Nx — EiVr > t 


\B\ 


m 


< exp — 


\3l 

2m 


( 11 ) 


By (fTOl) we have Taking t = r = | in (fTTT) . by the definition of the median, 

we have 


F{UB{h) - P^{h) > x} < P<[Va;> 


< exp 


\B\ 

32m 


Since \B\ > 32mlog(5 ^), with probability at least 1 — 5, we have 


Ueih) - P^h < Km<J 


[log 5 


q/2 


n 


with Km = 22™+^m 2 . The upper bound for the lower tail holds by the same argument. □ 


2.2 Bounded moment of order p with 1 < p < 2 

In this section, we weaken the assumption of finite variance and only assume the existence 
of a centered moment of order p for some 1 < p < 2. The outline of the argument is similar 
as in the case of finite variance. First we obtain a “weak” concentration inequality for 













the [/-statistics is each block and then use the property of the median to boost the weak 
inequality. While for the case of finite variance weak concentration could be proved by a 
direct calculation of the variance, here we need the rando mizat ion inequalities for conve x 
functions of [/-statistics established by de la Pena ( 1992l f and Arcones and Gini (1993). 
Note that, here, a P-canonical technical assumption is needed. 

Theorem 3. Let h be a symmetric kernel of order m such that h — is P-canonical. 
Assume that Mp := E [|/i(Ai,... ,Xm) — < oo for some 1 < p <2. Let 6 E (0, 

be such that |'log((5“^)] < Let B be a regular partition o/{l,...,n} with \B\ = 

32m |'log((5“^)]. Then, with probability at least 1 — 26, we have 


|[/b(/i) - mh\ < KmMp 


[log((5 1)] 


m{p-l)/p 


n 


( 12 ) 


where Km = 2 . 

Proof. Define the centered version of h by g{xi,..., Xm) ■= h{xi,..., Xm) — TUh. Let 
El,... ,en be i.i.d. Rademacher random variables (i.e., P{ei = —1} = P{ei = 1} = 1/2) 
independent of Ai, .. ., Xp , . By the randomization inequalities (see Theorem 3.5.3 in 
de la Pena and Gina ( 19991 ) and also Theorem [8] in the Appendix), we have 


E 


E 


) • • • ) Xk^) 






< 2”^pExE£ 


< 2”"?’Ex 


= 2”^PEx 


E, 


^ ^ e^i ... ,..., 

( \ 

^ki ■ ■ - ekmhiXki, - ■ ■ ,Xk„,) 


(13) 






/ J 


giXk„...,XkJ^ 

(k-l,...,km)&lB, ,...,B, 




< 2 ’"» ng(Xp„...,XpJV 

{kl,...,km)&lB,_, ,...,Ba 

= 2-P|P,J...|P,^|E|#. 


(14) 
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Thus, we have E 
inequality, 



..Si, 


(h) - mh\P 


< 2^P{\Bi^\... \BiJ)^-PE\g\P and by Markov’s 


-mh> 


2^Mp 

T 

TP 




(15) 


Another use of (llip with t = r = j gives 


Usih) - P^h < 


[log (5 


P 


n 


□ 


To see why the bound of Theorem [3] gives essentially the right order of magnitude, 
consider again the example described in the introduction, when m = 2, h{Xi,X 2 ) = A 1 X 2 , 
and the Xi have an a-stable law S( 7 ,a) for some 7 > 0 and 1 < a < 2. Note that an 
a-stable random variable has finite moments up to (but not including) a and therefore we 
may take any p = a — e for any e € (0,1 — a). As we noted it in the introduction, there 
exists a constant c depending on a and 7 only such that for all 1 < < ^2 < 1 ^, 


UBi^,Bi^{h) - ruh 


> 


/ \ 2 /< 

(i) 


a-2 


> ‘21‘i 


and therefore (1151) is essentially the best rate one can hope for. 


3 Cluster analysis with [/-statistics 


In this section we illustrate the use of the proposed mean estimator in a clustering problem 
when the presence of possibly heavy-tailed data requires rob ust techniques. 

We consider the general statistical framework defined by Clemencon ( 20141 ). described 
as follows: Let X,X' be i.i.d. random variables taking values in X where typically but not 
necessarily, A is a subset of W^). For a partition V of X into K disjoint sets-the so-called 
“cells”-, define ^p{x,x') = YliC€V^{{xX)&c'^} { 0 , 1 }-valued function that indicates 

whether two elements x and x' belong to the same cell C. Given a dissimilarity measure 


D : A2 


clustering risk 




the clustering task consists in finding a partition of X minimizing the 
W{V) = E [D{X, X')^b{X, X')] . 


Let Bk be a finite class of partitions V of X into K cells and define W* = min-pen^ W (P). 

Given Xi,...,X„ be i.i.d. random variables distributed as A, the goal is to find a 
partition V E Lfi^ with risk as close to W* as possible. A natural idea-and this is the 
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approach of IClemenconI (j2014l l-is to estimate W{V) by the ?7-statistics 


Wn{V) = 


n(n — 1 ) 


^ D{XuX^)^v{Xi,X,] 


l<2<j'<n 


and choose a partition minimizing the empirical clustering risk WniV). Clemencon ( 2014l i 
uses the theory of [/-processes to analyze the performance of such minimizers of [/-statistics. 
However, in order to control uniform deviations of the form sup-p^n^ 
exponential concentration inequalities are needed for [/-statistics. This restricts one to 
consider bounded dissimilarity measures D{X, X'). When D{X, X') may have a heavy tail, 
we propose to replace [/-statistics by the median-of-means estimators introduced 

in this paper. 

Let H be a regular partition of {1,... ,n} and define the median-of-means estimator 
Wts{V) of W{'P) as in ([ 6 ]). Then Theorem [1] applies and we have the following simple 
corollary. 


Corollary 4. Let Hk be a class of partitions of cardinality = N . Assume that 

:= E [D{Xi, X 2 )‘^] < oo. Let S € (0,1/2) be such that n > 128 |'log(A^/<5)]. Let B he 
a regular partition of {1,, re} with \B\ = 64 |'log(A^/5)]. Then there exists a constant C 
such that, with probability at least 1 — 26, 

sup \Wb{V) - W{V)\ <Ca( 

VgUk V ^ 



Proof. Since <l)pi(x,x') is bounded by 1, Var (Z/(Xi, X 2 )‘h-p(Xi, X 2 )) < E [Z/(Xi, ^ 2 )^]. 
For a fixed V G H^, Theorem [1] applies with rre = 2 and q = 1. The inequality follows from 
the union bound. □ 


Once uniform deviations of H/g(P) from its expected value are controlled, it is a routine 
exercise to derive performance bounds for clustering based on minimizing WsiV) over 
V G Hi^.^ _ 

Let V = argminpgn^ VFg(P) denote the empirical minimizer. (In case of multiple 
minimizers, one may select one arbitrarily.) Now for any Vo G 

W{V)-W* = W{V)-Wb{V)+ Wb{V) -w* 

< W{V) - Wb{V) + Wb{Vo) - W{Vo) + W{Vo) - W* 

< 2 sup \Wb{V)-W{V)\+W{Vo)-W* . 

V&Uk 

Taking the infimum over Bk, 

W(V) - IF* < 2 sup \Wb{V) - W{V)\ . (17) 

VeTix 
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Finally, (I16p implies that 


W{V) - IF* < 2Ca 


1 + 


\og{N/5) V/^ 


n 




This result is to be compared with Theorem 2 of Clemencon ( 20141 ) . Our result holds under 
the only assumption that D{X,X') has a finite second moment. (This may be weakened 
to assuming the existence of a finite p-th moment for some 1 < p < 2 by using Theorem [3l) . 
On th e other hand, our result holds only for a finite class of partitions while IClemencon 
(j2014l i uses the theory of tZ-processes to obtain more sophisticated bounds for uniform 
deviations over possibly infinite classes of partitions. It remains a chal lenge to develop a 
theory to control processes of median-of-means estimators-in the style of Arcones and Giiii 
( 199.‘ll )-and not having to resort to the use of simple union bounds. 

In the rest of this section we show that, under certain “low- noise” assumptions, anal¬ 
ogous to the ones introduced by Mammen and Tsvbakov ( 19991 ) in the context of classifi¬ 
cation, to obtain faster rates of convergence. In this part we need bounds for P-canonical 
kernels and use the ful l power of Corollary [2l S ii nilar argum e nts fo r the study of minimizing 


f7-statistics appear in IClemencon et al.l (j2008l b IClemenconI (120141'). 


We assume the following conditions, also considered by Clemencon ( 20141 ): 


1 . There exists V* such that W{V*) = W* 

2. There exist a G [0,1] and k < oo such that for all V G 11^^: and for all x G <T, 

P{$p(x, A) / A)} < k{W{V) - 1F*)“ . 

Note that a < 2 since by the Cauchy-Schwarz inequality, 

W(V) - IF* < E [Zl(Ai, A2)2]^/V{d>p(Ai, As) + $7:,* (Ai, As)}^/^ . 

Corollary 5. Assume the conditions above and that cr^ := E [ll(Ai, As)^] < 00 . Let 
6 G (0,1/2) be such that n > 128 |'log(A/J)]. Let B be a regular partition o/{l,...,n} 
with \B\ = 64 |'log(A/(f)]. Then there exists a constant C such that, with probability at 
least 1 — 26, 


W{V) - IF* < ^ ^ 


(18) 


The proof Corollary [5] is postponed to the Appendix. 


4 Appendix 


4.1 Decoupling and randomization 


Here we summarize some of the key tools fo r analyzing ^/-statist i cs tha t we use in the 
paper. For an excellent exposition we refer to de la Pena and Cine ( 19991 ). 
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Let {Xi} be i.i.d. random variables taking values in X and let k = 

be sequences of i ndependent copies. Let ^ b e a non-negative function. As a corollary of 
Theorem 3.1.1 in de la Pena and Gind ( 1999l l we have the following: 


Theorem 6 . Let h : A’’" M. be a measurable function with K\h{Xi,... ,Xm)\ < oo. Let 
$ : [0, oo) —)• [0, oo) be a eonvex nondecreasing funetion such that E<1> {\h{Xi,... ,Xm)\) < 
oo. Then 


E^> 




< E$ 




Jm 


where Cm = — l)((m — 1)”^ ^ — 1) x • • • x 3. Moreover, if the kernel h is symmetric, 

then. 


E^> Cr, 


Y,h{Xl,...,XZ 


< E<1> 




w/iere Cm = 1 /( 2 ^™ ^(m — 1 )!). 

An equivalent result for tail pro babilities of [/-statistics is the following (see Theorem 
3.4.1 in de la Pena and Gine ( 1999I L: 


Theorem 7. Linder the same hypotheses as Theorem 0 there exists a constant Cm de¬ 
pending on m only such that, for all t > 0, 


P' 


Y,HXi„...,x,„ 


> t > < CmP < Cm 


Y.h{xl...,x, 


m ' 
'I'm ■ 


> t 


If moreover, the kernel h is symmetric then there exists a constant Cm depending on m only 
such that, for all t > 0, 




Y,Kxl,...,xz 


Jm 


>t} < 


Y,h{X,„...,X,J 


Jm 


> t 


The next Theorem is a direct corollary of Theorem 3.5.3 in de la Pena and Gine ( 1999I L 


Theorem 8. Let 1 < p < 2. Let {£i)i<n be i.i.d Rademacher random variables indepen¬ 
dent of the (Xi)i<n. Let h : A —)• M [e o P-degenerate measurable function such that 
E{\h{Xi,...,Xm)\P) <oo. Then 

CmE I £ii ■ . . £imh{^il ) ■ ■ ■ ) < E | h{Xi.^ J • ■ ■ ; Xi^ ) 

Jm Jm 

■‘■n 

< CmE I 'y ^ . . . £i^h(^Xi^ ) ■ ■ ■ ) Xi^ 

Jm 

where Cm = 2”^^ and Cm = 2“”^^. 
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The same conclusion holds for decoupled [/-statistics. 


4.2 a-stable distributions 

Proposition 9. Let a G (0, 2). Let Xi,... be i.i.d. random variables of law S{'y, a). 
Let f^^a ■ X W be the density function of X\. Let Sn = Lhen 

(i) f-y,a{x) is an even function. 

(ii) f-y,a{x) ~ a7"cQ,x“““^ with Ca = sm{^)r(a)/TT. 


(in) E [Xf] is finite for any p < a and is infinite whenever p > a. 

(iv) Sn has a a-stable law 5(7n^/",a). 

Proof, (i) and ( iv) follow directly from the definition, (ii) is proved in the introduction of 
Zolotarev ( 1986l i. (hi) is a consequence of (ii). □ 


4.3 Proof of Corollary [5] 

Define A„(P) = Wn{P) — W*, the [/-statistics based on the sample Xi, ... ,Xn, with 
symmetric kernel 

h'p{x,x') = D{x,x') (4>p(x,x') — 4>7 o*(x,x')) . 

We denote by A(P) = W{V) — W* the expected value of A„(P). The main argument in 
the following analysis is based on the Hoeffding decomposition. For all partitions P, 

K{v) - A{V) = 2Ln{V) + Mn{V) 

for Ln{V) = h^^\x) = E[/ip(X,x)] - A(P) and Mn{V) the U- 

statistics based on the canonical kernel given by (x, x') = hp{x, x') — h^^'^ (x x') — 

AifP). Let .8 be a regular partition of n}. For any B G B, Ab{P) is the U- 

statistics on the kernel hp restricted to the set B and Ab[V) is the median of the sequence 
{Ab((P))b^b- define similarly Lb{V) and Mb{V) on the variables {Xi)i^B- For any 

BgB, 

Var(As(P)) = 4Var (Ls(P)) + Var (Mb(P)) 

Simple computations show that Var (Xi,X 2 )) = 2Var ([i^^^(X)) and therefore, 

Var (Ab(P)) < ^Var (h(F(x)) . 
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Moreover 


Var 



< Ex' 

< Ex' 


Ex [hp{X,X')] 
Ex [D{X,X'f] 


Ex 


{^r{X,X')-^r*{X,X'))‘ 


= Ex' [Ex [D{X,X'f] Px {^r{X,X') / ^r<X,X')}] 
< a‘^K{W{V)-W*)<^ 


where Ex (resp. Ex') refers to the expectation taken with respect to X (resp. X'). 
Chebyshev’s inequality gives, for r G (0,1), 


P |Aij(P) - A(P) > a{W{V) - y ^ ^ • 

Using again (fTT]l with r = |, by |i?| > 128 riog(Af/( 5 )] ’ there exists a constant C such that for 
any V G IIx, with probability at least 1 — 25/N, 

|Ab(P) - A{r)\ < Ca{W{V) - . 

V n 

This implies by the union bound, that 

iWeiV) - W{V)\ < Ka{W{V) - W*Y>^\ riog(A^/<5)l 

V n 

with probability at least 1 — 25. Using (fT7)l . we obtain 


{W{V) - < 2Ka\l^^^^ , 

V n 

concluding the proof. 
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